GLDADec: marker-gene guided LDA modeling for bulk gene expression deconvolution

Abstract Inferring cell type proportions from bulk transcriptome data is crucial in immunology and oncology. Here, we introduce guided LDA deconvolution (GLDADec), a bulk deconvolution method that guides topics using cell type-specific marker gene names to estimate topic distributions for each sample. Through benchmarking using blood-derived datasets, we demonstrate its high estimation performance and robustness. Moreover, we apply GLDADec to heterogeneous tissue bulk data and perform comprehensive cell type analysis in a data-driven manner. We show that GLDADec outperforms existing methods in estimation performance and evaluate its biological interpretability by examining enrichment of biological processes for topics. Finally, we apply GLDADec to The Cancer Genome Atlas tumor samples, enabling subtype stratification and survival analysis based on estimated cell type proportions, thus proving its practical utility in clinical settings. This approach, utilizing marker gene names as partial prior information, can be applied to various scenarios for bulk data deconvolution. GLDADec is available as an open-source Python package at https://github.com/mizuno-group/GLDADec.


Introduction
Quantification of the proportion of cell types in a tissue sample and understanding the contribution of individual cell types to the physiological state, such as immune responses associated with perturbation or evaluation of cancer tumor samples with cell proliferation, is of utmost importance [1,2].Flow cytometry, a typical experimental approach for quantifying compositional proportions of cell types, is widely used in molecular biology and immunology.However, its application is limited to fresh organs, human tissue specimens are challenging to analyse, and there is limited knowledge on the aggregation of f low cytometry data, making it difficult to utilize legacy data.
The advancement of high-throughput sequencing technology has led to an abundance of stored transcriptome data [3,4].The bulk transcriptome measures the accumulation of gene expression levels derived from various cell types and can be applied for extensive analyses using several well-established databases [3][4][5].Although databases for single-cell sequencing technology, a recent innovation, have also been developed, its high cost and sparce nature make it challenging to conduct large-scale data analyses [6].Hence, it is useful to establish a method to estimate the proportion of constitutive cells from the bulk transcriptome.
Deconvolution is a computational method that can be used to estimate the proportion of immune cells in a sample using transcriptome data.In recent years, several deconvolution methods have been proposed to infer cell type proportions from bulk expression data [7][8][9][10][11][12][13][14].These methods can be categorized into two main groups: reference-free methods and reference-based methods [15,16] Reference-free methods estimate cell type proportions solely based on the samples to be analyzed, making them less sensitive to external information that may cause confounding factors.This approach is reasonable for tissue data where the constituent cell types are not clearly defined, as the number of components can be estimated using likelihood and other factors [11,12].However, discerning the inferred components and their associated cells poses a challenge, particularly in more detailed cell types, thereby rendering the downstream task less interpretable.In contrast, reference-based methods utilize cell type-specific gene expression profiles, called references, as prior information.Although there have been some notable successes [8,13,14], the performance of these methods depends on the quality of the reference data and the batch-to-batch differences between the data to be analyzed.As a result, reference-based methods are only applicable in specific scenarios where the major cellular components are well-defined and appropriate reference data are available.While they are effective for simulated datasets or samples with well-defined constituent cell types, such as blood, they may underestimate the impact of gene expression profiles derived from cells that are not assumed as the reference [17].
Latent Dirichlet Allocation (LDA) has been developed in the context of natural language processing, and it is widely used in various domains, such as text embedding or semantic extraction from documents [18,19].The LDA model is designed to identify the topics that make up the content of documents, which is analogous to deconvolution, a method that extracts cell typespecific information from bulk transcriptome data.However, since LDA is an unsupervised learning method, it is classified as reference-free when simply applied to deconvolution, which presents a challenge in terms of cell identifiability.To address this issue, several approaches have been developed by incorporating prior information into the LDA algorithm and extending it to semisupervised learning [12,20].While these concepts are logical, they hinge on expression levels acquired from pure cell lines or singlecell RNA-Seq as prior knowledge, which is vulnerable to technical biases and imposes restrictive assumptions regarding distribution disparities from target bulk samples.
Here, we propose a novel guided LDA deconvolution (GLDADec) method, which utilizes marker gene names as partial prior information to estimate cell type proportions, thereby overcoming the challenges of conventional reference-based and reference-free methods simultaneously.This method employs a semi-supervised learning algorithm that combines cell-type marker genes with additional factors that may inf luence gene expression profiles to achieve a robust estimation of cell type proportions.Moreover, a median selection strategy is used to aggregate the outputs and achieve a more accurate estimation.We benchmarked GLDADec against existing methods using blood-derived samples with welldefined constituent cells, and it consistently outperformed the existing methods for multiple datasets.We also applied GLDADec to liver bulk RNA-Seq data from drug-induced liver injury models of mice and rats, demonstrating its usefulness for tissue data analysis.Our model, which considers additional topics, ref lects the biological processes underlying the tissue and provides a robust estimation of the guided target cell types.Additionally, by collecting marker gene names in a data-driven manner and using them as prior information on comprehensive cell types, it is possible to estimate a wider range of cell types that are inaccessible by conventional methods.As a further demonstration, we applied GLDADec to human tumor samples, revealing new insights into cancer subtyping and clinical prognostic stratification.GLDADec is available as an open-source package at https: //github.com/mizuno-group/GLDADec.

Latent Dirichlet allocation algorithm for deconvolution
The graphical model of the generation process is depicted in Fig. 1.Each bulk gene expression sample indexed by m is the accumulation of a mixture of K cell types.It is assumed that the sample-topic distribution θ m and topic-gene distribution ϕ k are generated from the Dirichlet distribution as follows: where α and η are hyperparameters.The latent topic Z m,n and gene G m,n are derived from multinomial distributions Z m,n ∼ Multi (θ m ) and G m,n ∼ Multi ϕ Zm,n , respectively.The joint probability is given by: With respect to the observed gene expression levels, the probability distribution that generated them is approximated through collapsed Gibbs sampling [21].Specifically, the posterior distribution is expressed as follows: where G \m,n is the distribution without counting G m,n = v and Z \m,n is without counting Z m,n = k.Note that n \m,n k,v is the total number of counts allocated to topic k excluding the n th gene v in sample m, and n \m,n m,k is the number of counts in sample m excluding the genes allocated to topic k, respectively.By utilizing these probability distributions, latent topics are sampled, and genes are assigned.The pseudo code for collapsed Gibbs sampling is described in Algorithm1.
The joint distribution of gene set G and topic set Z is given by integrating out θ and ϕ as follows: where The proof of Equation 1.5 is provided in Supplementary Note S3.Then we have the log-likelihood at the sth step of the sampling process: and the convergence during iteration is monitored with this.
Figure 1.Overview of GLDADec. the observed gene expression profiles are considered as bag-of-words.We extend the standard LDA generation process to incorporate semi-supervised learning, where the gene names specific to each cell (topic) serve as partial prior information to guide the process.By running GLDADec, we can obtain θ, which ref lects the cell type proportions in each sample.
The LDA algorithm provides the topic distribution for each sample θ m and the gene distribution for each topic ϕ k as output.In the context of cell-type deconvolution, the topics in the obtained θ m correspond to the cell types that make up the sample.

Marker gene name based guided LDA algorithm
The LDA algorithm is an unsupervised learning algorithm that poses a challenge in identifying the cell type corresponding to a particular topic.To address this issue, guided LDA models have been proposed, which incorporate prior information to guide the topics and specify a direction towards the generation process of the LDA model [22,23].In the context of deconvolution, the marker gene set of the cell assumed as topic k, S k , is used to guide the LDA algorithm as follows: where the general concept of this process is to initialize the topicgene distribution and to link topics to cells and allow identification of specific cell proportions.During the learning process, the contribution of marker genes to the corresponding topic changes.Therefore, for guided marker genes, it is also possible to probabilistically avoid updating Z m,n using a parameter.

Selection of genes for analysis
Consider a measured bulk gene expression matrix Y ∈ R L×M for L types of genes across M samples.In our proposed algorithm, collapsed Gibbs sampling is repeated N times, the sum frequency of L types of gene expression in all M samples.We excluded genes with that were outliers >2σ from the log-normal distribution of gene expression levels.This procedure targets genes with consistently high expression levels observed in RNA-seq, notably mitochondrial and ribosomal genes.These genes are not informative and rather can introduce noise when modeling gene expression levels derived from changes in cell proportions.Furthermore, the transcriptome variation may also arise from simple expression change, such as those caused by perturbations.Therefore, in addition to the marker genes given as prior information, topn genes with large coefficient of variation among samples were selected for analysis.The hyperparameter topn was set to a default value of 100 for the analysis of blood samples and 1000 for more heterogeneous tissue samples.See Supplementary Notes S1 and 'Hyperparameter sensitivity analysis' section for details.

Additional topics for tissue data analysis
Tissues are composed of a diverse range of cell populations.In addition to the guided topics that allow for cell identification, we also observe the effects of other topics, such as unexpected cell types or confounding between experiments.To account for these factors, we assume that K topics are composed of the K g guided topics of interest and the K u additional unguided topics with no prior information.
To determine the optimal number of additional topics K u , we propose a recursive algorithm that tests gene contribution to each topic.After adding K u unguided topics and performing deconvolution, we obtain ϕ k∈{1,2,...,K} ∈ R K×L , which is the contribution of L types of non-marker genes to each additional topic.For each additional topic, we extract the top L /K genes with highest contribution value and finally pool N g genes.Then, we obtain Pearson correlation matrix P ∈ R Ku×K for the contribution profiles of the N g genes across the added K u topics and overall K topics.When an element of P is significantly positive, the added K u topics are considered to contain redundancy.The value of K u is increased until this redundancy is detected, and the final K u is determined.

Strategy for robust estimation
The proportions of cell types in a sample are represented by the sample-topic distribution θ m (m = 1, • • • , M). Due to the dependence of the Gibbs sampling process on random variables, a single trial may not be sufficient for accurate estimation.To address this, we employed the median selection strategy, which integrates θ m from multiple random seeds and ensures that the cell type proportions sum to 1.

Functional analysis for the additional topics
We can obtain K u × En topic-gene distributions for unknown topics by assembling the output of the models considering K u additional topics En times.K-means clustering was employed to determine the contribution of L genes to each potential additional topic.Subsequently, enrichment analyses of the top contributing L/K u genes with Gene Ontology (GO) [24] and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway [25] were performed utilizing the Fisher's exact test.We also performed single-sample gene set enrichment analysis on the gene contribution values.

Data preparation
To assess the performance of GLDADec in estimating gene expression levels in blood and tissue samples, we selected datasets that included corresponding bulk transcriptome and immune cell proportions determined by f low cytometry.Additionally, we obtained human clinical data for practical real-world data application.GLDADec takes a gene expression matrix Y ∈ R L×M for L genes across M samples as input and utilizes non-log linear scale data derived from the bulk transcriptome.A pseudo-bulk dataset derived from single-cell transcriptomes is also utilized to evaluate the robustness of GLDADec.A summary of the datasets used is provided in Supplementary Table S1.See Supplementary Notes S1 for details on the experimental conditions, including preprocessing of each dataset.

Comparisons with other deconvolution methods
We compared GLDADec with seven competing bulk reference deconvolution methods, including FARDEEP, EPIC, CIBERSORT, DCQ, NNLS, RLR, and Elastic Net, utilizing their default settings where applicable [7][8][9][10].Note that the input matrix for each method was normalized separately.Specifically, a log normalized gene expression matrix was used for DCQ and FARDEEP, while a non-log linear scaled matrix was utilized for the remaining methods.In addition, baseline scores for state-of-the-art methods including GTM-decon, BayesPrism, CIBERSORTx, MuSiC, and BSEQ-sc are reported by Swapna et al. [14,20,26,27].We obtained the deconvolution results from their GitHub repository (https:// github.com/li-lab-mcgill/gtm-decon).See Supplementary Note S2 for further details.

Benchmarking with human blood samples
To begin with, we evaluated our algorithm using three datasets comprising bulk transcriptome data from human blood (GSE-65133, GSE107572, and GSE60424) [9,13,28,29], along with the corresponding proportions of immune cell types determined through f low cytometry.The marker gene names specific to each cell type were defined using domain knowledge, and 100 genes with large coefficients of variation (CV) among samples were included for analysis (as outlined in Supplementary File S1).Our proposed method, which utilizes marker gene names as prior information, achieved highly accurate predictions (Pearson correlations ranging from 0.39 to 0.99), considering each cell type individually (Fig. 2A; Figs S1-3).Since the cell types with known proportions in these datasets are covered by LM22, a signature matrix defined by Newman et al. [13], we compared the proposed GLDADec against seven bulk reference-based methods in using LM22 as a reference, including FARDEEP, EPIC, CIBERSORT, DCQ, NNLS, RLR, and ElasticNet [7][8][9][10], in terms of Pearson correlation and mean squared error (MSE).The proposed method demonstrated a high Pearson correlation across all benchmark datasets, and comparable or superior MSE scores to existing methods on datasets except for GSE60424 (Fig. 2B; Figs.S1-3, S6A).Notably, for the dendritic cells of GSE107572, the estimation performance was significantly improved by our proposed method.The similarity matrix of the estimates across the three datasets revealed that GLDADec shows a relatively similar profile to DCQ (Fig. S7A).The detailed experimental conditions for using existing methods are provided in Supplementary Note S2.To ensure that these estimates were robust and independent of the number of added high CV genes and hyperparameters, a sensitivity analysis was conducted (see 'Hyperparameter sensitivity analysis' section).
We then conducted a benchmarking assessment of GLDADec against state-of-the-art methods, including GTM-decon, Bayes Prism, CIBERSORTx, MuSiC, and BSEQ-sc [9,14,20,26,27].These methods offer more detailed cell type estimations, utilizing single-cell RNA-Seq as prior information.SDY67 and GSE107011 [30,31], which maintain known detailed cell type proportions, were chosen as benchmark datasets, with baseline scores reported from Swapna et al. [20].While GTM-decon and CIBERSORTx exhibited relatively superior performance on SDY67 and GSE107011, respectively, GLDADec demonstrated equivalent or improved performance compared to existing methods across both datasets in terms of Pearson correlation and MSE (Fig. 2C; Fig. S4-5,6B).Analysis of the similarity matrix of the estimates revealed that GLDADec exhibits an independent profile compared to other methods (Fig. S7B).
Taken together, these findings underscore the promise of semi-supervised topic modeling in deconvolution, with GLDADec enhancing estimation performance in real bulk scenarios derived from human blood.

Impact of introducing additional topics
The transcriptome of tissues consists of heterogeneous cell populations, including cells that are not typically expected to be present in the tissue.To account for these unknown cell types, we performed modeling that treated them as additional topics.We investigated two scenarios to assess the utility of introducing additional topics into GLDADec.
Initially, we performed benchmarking on brain tissue data with known cell type proportions, known as ROSMAP [30].All methods exhibited reasonably good performance on this dataset, with similar predictive capabilities.However, predicting endothelial cells and oligodendrocytes consistently posed challenges (Fig. S8).This dataset served to assess whether a cell type excluded from guided topics could be reconstructed by introducing an additional topic.Remarkably, for all cell types except endothelial cells, the added topic proportions exhibited high correlation with the groundtruth proportions of the excluded cell type (Fig. 3A).Furthermore, marker genes of the missing cell type were significantly enriched in the added topic as contributing genes (Fig. 3B).To gain further biological insights into the additional topic, we conducted GO [24] analysis for genes making substantial contributions to the added topic.Notably, the topic reconstructing microglia ref lected the regulation of microglial differentiation and activation, while the topic reconstructing neurons ref lected neurotransmitter transport as a biological process (Table 1; Table S2).
We next validated the impact of introducing additional topics by evaluating immune cell trafficking in tissues with perturbation, using liver bulk RNA-Seq data from various drug-induced liver injury mouse models.We compared modeling with and without additional topics, considering four known cell types (neutrophils, monocytes, NK cells, and Kupffer cells).The optimal number of additional topics increased with the number of genes analyzed for each acetaminophen (APAP) and alpha-naphthyl isothiocyanate treatment group ( Fig. S9A and B).Modeling with additional topics suppressed the variance of the estimates for several cell types, improving the performance of estimating the proportion of each cell type (Fig. 3C; Figs S9C and S10).In particular, the estimation of Kupffer cells with APAP administration was greatly improved (Fig. 3D).To gain further biological insight into the additional topics, we conducted GO and KEGG pathway [25] analysis for the genes with large contribution to each additional topic.The significant biological processes were mainly related to metabolism and biosynthesis, consistent with a major biological   S9D; Table 2; Table S3).
These results indicate that adding topics can improve the performance of guided cell estimation and allows the aggregation of biological functions that exist in the background of the target tissue as unknown topics.

Comprehensive cell type analysis for mouse data
In recent years, databases containing marker genes have become prevalent and easily accessible [32,33].By obtaining a datadriven approach to obtain marker gene names for a diverse array of cell types, we can accurately estimate the proportions of a comprehensive cell type.In this study, we curated marker genes in mouse liver obtained from CellMarker [ 32] and defined marker genes specific to liverrelated cell types (Supplementary File S3).Using these marker genes and applying the proposed method to liver tissue during drug-induced liver injury, we estimated the proportion changes for a wide range of 26 cell types (Fig. 4A; Fig. S11A).Notably, GLDADec allows the estimation of cell types such as hepatocytes and vascular smooth muscle cells (VSMCs), which have rarely been considered in conventional deconvolution methods.Pearson correlation was evaluated for four cell types validated by f low cytometry: neutrophils, monocytes, NK cells, and Kupffer cells.These cell types are widely recognized as playing a significant role in the pathophysiology of drug induced liver injury.As shown in Figs.4B and S11B, the proposed method accurately estimated the change in immune cell proportion with perturbation.In addition, although there were no ground-truths measured by f low cytometry for cell types such as hepatocytes and VSMCs, they showed significant positive or negative correlations with blood biochemistry values such as alanine aminotransferase (ALT) and aspartate aminotransferase (AST) (Tables S4 and S5; Fig. S12).These results suggest that accurate estimates that ref lect individual responses can be achieved for a wide range of cell types.
To assess the robustness of the proposed method to prior knowledge, we altered the marker genes used.Differentially expressed genes (DEGs) between LM13, 13 representative cell types of mouse liver, derived from transcriptome data of each cell type were deemed as marker genes and exhibited consistent superior estimation performance (Fig. S13).Additionally, we evaluated the performance of existing reference-based methods utilizing DEGs, and the proposed method surpassed that of existing methods (illustrated in Fig. 4C; Fig. S11C).

Application to rat data with poor marker information
Owing to the extensive use of rats in toxicology, there is plenty of toxicogenomic data for rats, a valuable resource for data analysis in drug and chemical safety assessments [17,34,35].On the other hand, the cell marker information of rats was much lower than that of the mice.Our previous work showed that when using existing reference-based deconvolution, mouse-derived reference expressions are not extrapolatable, and rat-specific references should be used.However, databases containing gene expression profiles specific to each cell type in the rat are not as abundant as those in the mouse and must be obtained independently, which is costly and time-consuming.
In this section, we evaluate the performance of the proposed method using marker gene names defined in various scenarios.The method of defining the names of marker genes used to We found that the estimation performance was excellent when either was used as prior information (Fig. 5A).Additionally, the estimation performance of the proposed method using each marker outperformed that of the existing reference-based method using mouse LM6, which is a set of representative immune cell types widely used in deconvolution (Fig. 5B).These results suggest that the proposed method, which does not depend on the expression level of the reference gene, effectively achieves cell type proportion estimation using a priori information on the marker gene names that are conserved across species.
Furthermore, our analysis of the immune response in rat liver injury using the GLDADec database provides a more extensive insight into the immune response in rat liver injury, utilizing the mouse marker gene names.By obtaining the same marker genes examined in mouse liver injury, we were able to estimate the comprehensive cell type proportions in rat liver injury and compare the profile changes between mice and rats (Fig. S14A and B).Our findings revealed a striking similarity in the changes of cell proportions in mice and rats, such as an increase in neutrophils and monocytes and a decrease in hepatocytes, following the administration of acetaminophen (Figs.5C; Fig. S14C).Additionally, we identified distinct immune cell trafficking patterns, including increased natural killer T cells in mice but decreased in rats (Fig. 5D).These conclusions indicate that our proposed method is a valuable tool for identifying similarities and differences in immune cell trafficking between species through comprehensive estimation of cell type proportions.

Application GLDADec to tumor samples
The tumor tissue is composed of various cell types, including infiltrating immune cells, stromal and vascular cells, and subclonal cancer cells, as well as other cell types [37].To validate our prediction in tumors, we applied GLDADec to 2037 tumor samples from The Cancer Genome Atlas (TCGA) for three tumor types: breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), and liver hepatocellular carcinoma (LIHC) [38][39][40][41].We used marker gene names obtained from CellMarker as prior information for each cancer subtype in the background tissues.Our proposed method stratified the cancer subtypes by comprehensively estimating the proportion of various cell types, including tissue-specific cells such as ciliated cells in LUAD samples and hepatocytes in LIHC samples (Fig. 6A).
To assess the refined stratification performance more comprehensively, we examined the differences in estimated cell proportions for each breast cancer subtype, as reported by Brian et al. [42] (Fig. 6B).As anticipated, the proportion of luminal cells was found to be higher in subtypes LumA and LumB (Fig. 6C).Moreover, we observed a distinct pattern of immune cell infiltration, particularly monocytes, B cells, and plasma cells, in the Her2 subtype (Fig. 6D; Fig. S15).This suggests that subtype classification based Next, we examined the relationship between the predicted proportion of cell types and patient survival.Tumor samples were categorized according to the estimated proportion of specific cell types, and Cox proportional hazards regression was employed to determine survival rates between the top and bottom 25% of samples.Our analyses revealed that patients with LUAD exhibiting a high level of B cell infiltration had worse overall survival (P = .012),while those with LIHC demonstrated improved survival (P = 8.13e-03) (Fig. 6E).These findings indicate that B cells contribute differently to the prognosis and treatment of breast and liver cancers, which is consistent with previous reports [43,44].Moreover, our study found that the infiltration of immune cells, such as γ δT and CD4+ T cells, was associated with better clinical outcomes in breast cancer, which supports previous studies [45,46] (Fig. S16).In addition, GLDADec can estimate the proportion of various cell types with marker gene names, and it is expected to provide novel insights.For instance, we reported that the loss of alveolar type II cells may be the cause of the poor prognosis of LUAD (Fig. 6F).

Hyperparameter sensitivity analysis
The benchmark dataset was utilized to carry out sensitivity analyses for the hyperparameters α for sample-topic distribution, η for topic-gene distribution, as well as the number of high CV genes and the number of median selection attempts.It was observed that GLDADec exhibits robustness with respect to α and η under conditions between 0 and 1, which are commonly used in LDA [21,47].Under conditions that allow for more than one, the estimates were found to be less dependent on α and more susceptible to changes in η (Fig. S17A).In terms of the number of high CV genes to add, it was observed that gamma-delta T cells (γ δT) exhibited better estimation performance when a higher number of genes were added; however, this effect was limited for many other cell types (Fig. S17B).Regarding the number of median selection attempts, there was little effect on the human blood-derived benchmark dataset, while more accurate estimation performance was obtained by increasing the number of trials in the analysis on perturbed mouse and rat tissues ( Fig. S17C and D).

Evaluation of robustness using pseudo bulk dataset
Given that topic modeling for a limited number of documents remains a challenging task [48,49], we examined the inf luence of sample size on the estimation performance of GLDADec.To simulate tissue data for varying sample sizes, pseudo lung bulk data was generated from single-cell RNA-Seq data.GLDADec was applied to 100 randomly generated samples, demonstrating its capability to yield highly accurate deconvolution predictions for both immune cells and tissue background cells (Fig. S18A).When the number of samples was reduced to 10 or 5, estimation performance slightly decreased, and variance increased for certain cell types, such as alveolar type I cells and cytotoxic T cells.Nonetheless, the overall estimation performance remained high across immune cell subtypes (Fig. S18B).
We further verified the robustness of GLDADec by ensuring its independence from differences in the distribution of expression levels between the reference and the bulk samples under analysis.This was achieved by partitioning the dataset used for pseudo bulk creation into main and test sets without any information leaks, and utilizing the reference defined from the test cells.Consequently, high estimation performance was attained, even with conventional methods such as ElasticNet and RLR (Fig. S19A).Conversely, when lung single-cell RNA-Seq data from a distinct dataset was utilized as a reference [50], these reference-based methods consistently exhibited inferior performance (Fig. S19B).
Overall, we demonstrated the robustness of our proposed method to sample size and prior information through analyses conducted on pseudo bulk data.

Discussion
We introduce GLDADec, a marker gene name-based deconvolution approach, to estimate the cell type proportions present in a heterogeneous sample.In the process of inferring the topic distribution of each sample, we utilize marker gene names specific to each cell type as partial prior information to guide the cell name to the topic.Additionally, we can detect and incorporate unknown potential topics that characterize the sample, alongside the guided topics, simultaneously.The estimated topic distribution can be utilized to calculate the proportion of cell types from the bulk RNA-Seq dataset.
A notable advantage of GLDADec is the ease of accessing the name of the marker gene used as prior information.Although gene expression profiles of immune cells can typically be obtained in humans and mice, this is not always feasible for species such as rats that are used in limited areas or for tissue-specific parenchymal cells [35].As a result, reference-based methods are dependent on the amount of prior information available for the scope of their analysis.However, marker gene names are uniformly accessible from databases, and there are also classical markers that are conserved across species, enabling a far more extensive range of prior information to be utilized in terms of the number of cell types and species to be analyzed.
One of the benefits of this method is its ability to infer additional topics other than guided topics, which may be useful in accounting for the impact of unknown cell types or removing confounding factors such as batch effects.In our analysis of liver samples from a drug-induced liver injury mouse model, we observed that the estimation of Kupffer cells was significantly improved by considering additional topics, whereas the benefits for neutrophils were minimal (Fig. 3D).For neutrophils, the estimated values were robust even without considering additional topics, suggesting that the variance in neutrophils was small and thus the impact of additional topics on the cell type was limited (Fig. S10A).We also found that the additional topics ref lected biological characteristics, such as enrichment in metabolic and biosynthetic pathways (Fig. S9D), which is consistent with the biological role of the liver.These topics likely ref lect the inf luence of the transcriptome derived from hepatocytes, which constitute the majority of liver tissue.
Applying GLDADec, we quantified the fraction of diverse cell types within tumor samples registered in the TCGA database.We focused on minor cell types that were not typically identified through conventional approaches.For instance, we revealed the significant association between the presence of alveolar type II cells and unfavorable clinical outcomes for patients with LUAD (Fig. 6F).These findings possess the potential to facilitate subsequent analyses and contribute novel perspectives.
It should be noted that the cell types covered by GLDADec are contingent upon the extent of knowledge of cell types and their marker genes that have been accumulated in the field of life sciences.In this study, we obtained marker gene names from domain knowledge or CellMarker database.However, the accuracy of marker gene selection may decrease when the labeling of the ground-truth is imprecise, as seen in the case of 'lymphocytes' in GSE60424.This may explain why GLDADec exhibited a poor MSE score in GSE60424.Furthermore, although the CellMarker database is well characterized, there is room for improvement in coverage, as eosinophils, a cell type that plays a protective role in acute liver injury, are not registered in liver tissue [51].Therefore, for a more comprehensive analysis, it may be necessary to integrate multiple databases that store marker genes to consolidate information.On the other hand, it's worth noting that the protein markers listed in these databases may not consistently serve as good transcriptional markers.Currently, data cleansing relies heavily on manual curation employing domain knowledge.Given the recent advancements in large language models, extracting marker gene information from scientific literature could prove beneficial in identifying appropriate marker gene names for deconvolution of tissues of interest.While social acceptance is crucial as knowledge, such knowledge miners could prove to be powerful partners of the proposed method.

Key Points
• We introduce GLDADec, a novel bulk deconvolution method leveraging marker gene names as partial prior information for estimating cell type proportions.• GLDADec adopts a semi-supervised learning algorithm, utilizing cell type marker genes to address challenges present in both conventional reference-based and reference-free methods simultaneously.• GLDADec demonstrates strong estimation performance and biological interpretability as evidenced by benchmarking across blood-derived and perturbed tissue datasets.• We utilized GLDADec on TCGA tumor samples, conducting cancer subtype stratification and survival analysis, showcasing its utility in clinical data analysis.

Figure 2 .
Figure 2. Benchmarking with human blood samples.(A) Scatterplots comparing estimated proportions and measured immune cell proportions in GSE65133.(B) Heatmaps showing Pearson correlation to compare the estimation performance against existing bulk reference-based methods using LM22 on three benchmark datasets, GSE65133, GSE107572, and GSE60424.(C) Heatmaps showing Pearson correlation to compare the proposed method to state-of-the-art methods on two benchmark datasets, SDY67 and GSE107011.The barplots on the right shows the performance of each method across all cell types.

Figure 3 .
Figure 3. Assessment of the usefulness of incorporating additional topics in the analysis of tissue data.(A) Heatmap showing the correlation between the reconstructed cell type proportion and the ground-truth proportion.The maximum value for each row is highlighted.(B) Gene set enrichment analysis (GSEA) was performed on the gene list ranked by contribution to the additional topic reconstructing a missing cell type.The shading band represents the degree of contribution of each gene to the additional topic.The bottom vertical lines represent the location of the marker genes.(C) Variance of estimates at median selection for each sample with and without additional topics.(D) Radar chart comparing the difference in estimated performance with and without additional topics.The axis values signify the Pearson correlation between estimated and measured proportion of each immune cells.

Figure 4 .
Figure 4. Prediction of immune cell trafficking in mouse liver tissue perturbed by acetaminophen (APAP).(A) Comprehensive cell type proportions estimation using marker gene names collected in data-driven manner.(B) Scatterplot showing the estimated proportions of immune cells versus measured values in the same dataset.The circles indicate the control samples, and the triangles indicate the samples treated with APAP.(C) Comparison of performance using Pearson correlation with other notable deconvolution methods.

Figure 5 .
Figure 5. Estimation on rat samples with poor marker information.(A) Scatterplots showing the estimated proportions of immune cells alongside the corresponding measured values within the same dataset.These graphics represent the outcomes of defining marker genes in three distinct scenarios: Classical marker genes, mouse marker genes obtained from CellMarker, and differentially expressed genes derived from cell type specific gene expression profiles in mice.The circular symbols denote control samples, while the triangular symbols represent samples that have been treated with acetaminophen (APAP).(B) Comparison of performance using Pearson correlation with other notable deconvolution methods.(C-D) common and different patterns of immune cell trafficking in the liver between mice and rats treated with APAP.

Figure 6 .
Figure 6.Application of GLDADec to tumor samples.(A) Stacked barplots showing the estimated comprehensive cell type proportions in each tumor samples derived from BRCA, LUAD, and LIHC.(B) Relative proportions of comprehensive cell types in different subtypes of BRCA.(C-D) boxplots showing specific accumulation and infiltration between each subtype of BRCA.(E-F) Kaplan-Meier plots displaying the relationship between survival and the infiltration of B cells in LUAD and LIHC, as well as alveolar cell type 2 in LUAD.The patients were divided into two groups based on the top and bottom 25 th percentiles of the target cells, and the results were analyzed using the log-rank test.the transparent colors in the plots represent 95% confidence bands, and the P-values were computed using the log-rank test.

Table 1 .
Gene ontology (GO) enrichment analysis for the added topic that reconstructing missing microglia.The top 10 significantly enriched GO terms are shown with Benjamini-Hochberg adjusted P-values role for liver tissue and ref lecting the inf luence of potential transcriptomes such as hepatocytes ( Fig.

Table 2 .
KEGG pathway enrichment analysis.Significantly enriched pathway names are shown with Benjamini-Hochberg adjusted P-values